In data science, EDA (Exploratory Data Analysis) is the crucial first step of analyzing datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions, primarily using visualizations and summary statistics, to understand the data’s core characteristics before formal modeling. It’s the detective work that helps data scientists grasp the data’s structure, identify errors, and decide how to best manipulate it for accurate insights, guiding feature selection and model building.
Question 1: Plot waveforms for each activity class
Plot the waveform for one sample data from each activity class. Analyze differences/similarities between the activities.
Code
import syssys.path.append('..')from Assignment.HAR.MakeDataset import X_tr/my-university-archive├── _quarto.yml <-- The BRAIN (Global config)├── index.qmd <-- The Homepage (Welcome screen)├── Assignment_01/│ ├── Task1.ipynb│ └── Task2.ipynb└── Assignment_02/ ├── Task1.ipynb └── Task2.ipynbain, y_train# import os import matplotlib.pyplot as pltimport numpy as npActivity_Classes = {1:'WALKING', 2:'WALKING_UPSTAIRS', 3:'WALKING_DOWNSTAIRS', 4:'SITTING', 5:'STANDING', 6:'LAYING'}# X_train is 3D array: (# samples = (#activity x #subjects), # timesteps, # features = (3 : accx, accy, accz))# we have 21 subjects for train data and 9 subjects for test data# 6 activities# X_train.shape = 21*6, 500, 3# print(X_train.shape)# open('X_train.txt', 'w').write(str(X_train))
Training data shape: (126, 500, 3)
Testing data shape: (54, 500, 3)
Will plot acc data on y and time steps on x axis
Data is SHUFFLED by train_test_split, so we need to use y_train to find samples by activity
Differences: - Static activities (SITTING, STANDING, LAYING): Show relatively flat, stable signals with minimal fluctuation - Dynamic activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS): Show periodic oscillating patterns due to repetitive motion
Similarities: - All static activities have similar low-variance patterns, making them harder to distinguish from each other - Walking activities all show periodic patterns but with different frequencies and amplitudes
Can the model classify activities? - Yes, the model should be able to distinguish between static and dynamic activities easily due to clear differences in variance/amplitude - Distinguishing between different walking activities (upstairs vs downstairs) may be more challenging as patterns are similar - Distinguishing between static activities (sitting vs standing vs laying) may also be challenging
Question 2: Do you think we need a machine learning model to differentiate between static activities and dynamic activities?
No, we don’t need a machine learning model to differentiate between static and dynamic activities.
Justification: - For static activities , the total acceleration magnitude \(\sqrt{acc_x^2 + acc_y^2 + acc_z^2} \approx 1g\) and remains nearly constant (low variance), because the accelerometer only measures the acc due to gravity at stationary state
For dynamic activities, the magnitude oscillates significantly (high variance)
A simple threshold-based rule on variance or standard deviation can reliably classify:
if std(total_acc) > threshold = Dynamic, else = Static
Question 3: Visualize the data using PCA
Use Three methods for extracting important features - PCA on Tot_Acc, using TSFEL lib and then PCA, PCA on features provided by guys who made dataset itself :)
\(accx^2 + accy^2 + accz^2 = Signal Energy\) (for magnitude we need sqrt)
1. PCA on Tot_Acc
PCA on Total Acceleration: To compress the acceleration time series into 2 features (dimensions).
PCA on the total Acc
There is 1 data sample (one recording window) of a person doing a single activity. This single sample contains 500 time steps (dimensions).
If we compress these 500 time steps down to 2 Principal Components (coordinates) \(\rightarrow\) we get 2 numbers (\(x, y\)) that represent that entire recording.
This helps the human eye find patterns in a scatter plot.
Correction: It’s not that we had “500 dots” before. It is that we had 1 dot existing in a 500-dimensional invisible space. We physically cannot see 500 dimensions.
By compressing it to 2 dimensions, we can now project that 1 dot onto a flat 2D screen. This allows us to check if the “dots” for static activities (Sitting) are visibly separated from the “dots” for dynamic activities (Walking).
So this helps in EDA.
Method: - Standardization (\(\mu=0, \sigma=1\)) - Covariance Matrix - Find Eigen Values and Vectors - Sort and Select Comps - Project the new onto new subspace (tranoformation \(Y=XW\))
AS we need to look at scatter plot for all activities we need a a shared coordinate system, so we dont have to do PCA for each activity, we need to convert whole dataset to new system of 2 features.
Code
# using covaraiance matrix method as above is not used in data analysis bcz calculating covariance matrix for large datasets is computationally expensive and may not be feasible for real-time applications.# so we use sklearn lib, which uses SVD method internally to compute principal components efficiently. wihtout needing the space and time complexity of covariance matrix calculation.from sklearn.decomposition import PCAsquared = X_train**2X_energy = np.sum(squared, axis=2) # axis=2 means summing along the feature axis (accx, accy, accz), as its zero based indexing 0,1,2# open ('X_train_Tot_Acc.txt', 'w').write(str(X_train_Tot_Acc))# print(X_train_Tot_Acc)print(X_energy.shape)pca = PCA(n_components=2)X_reduced_m1 = pca.fit_transform(X_energy)print(f"Shape of reduced data for plotting: {X_reduced_m1.shape}")print(f"Variance explained by these 2 PCs: {pca.explained_variance_ratio_}")
(126, 500)
Shape of reduced data for plotting: (126, 2)
Variance explained by these 2 PCs: [0.10718737 0.08658314]
Scatter Plot
Code
fig,axes = plt.subplots()axis = axesscatter = axis.scatter(X_reduced_m1[:, 0], X_reduced_m1[:, 1], c=y_train, cmap='viridis', alpha=0.6, s=10)handles, _ = scatter.legend_elements()activity_labels = [Activity_Classes[i] for i insorted(Activity_Classes.keys())]legend1 = axis.legend(handles, activity_labels, title="Activities")axis.add_artist(legend1)axis.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)")axis.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)")axis.set_title("PCA on Total Acceleration (Signal Energy)")plt.tight_layout()plt.show()
2. Using TSFEL and then PCA
Instead of feeding raw numbers for PCA, we calculate statistics (features) like Mean, Variance, Entropy, Zero-Crossing Rate, FFT Peak. They are much better at distinguishing overlapping activities rather than raw data (we can see that the 3 activities of standing,;laying and sitting overlaped almost into single point due to ~1g value ).
Steps:
Data Serialization: TSFEL is designed for continuous streams and 2D space. We “unroll” our 3D dataset \((N, 500, 3)\) into a long 2D stream \((N \times 500, 3)\) and then use TSFEL’s windowing function to reconstruct the samples.
Feature Extraction: We extract statistical features independently for each axis (\(x, y, z\)). (3 cols of pd dataframe). tsfel does this internally, calculates mean, variance etc for each column and then concatenates 3 cols to one wide row
Standardization (Critical): We mix features with vastly different physical units (e.g., Variance in \(m^2/s^4\) vs. Mean in \(m/s^2\)). Without StandardScaler, PCA would be biased towards features with larger numerical magnitudes, ignoring potentially discriminative but smaller-valued features.
Code
import tsfelimport pandas as pdfrom sklearn.preprocessing import StandardScaler# prepare data# convert (N, 500, 3) -> (N*500, 3)# This creates one continuous stream of data.X_flat = X_train.reshape(-1, 3)df_stream = pd.DataFrame(X_flat, columns=['acc_x', 'acc_y', 'acc_z'])"""tsfel config - define which features to extract. - 'statistical' includes mean, var, kurtosis, etc. - 'temporal' includes slope, zero-crossing, etc."""cfg = tsfel.get_features_by_domain('statistical') # 'temporal' and 'spectral' needs more CPU and time which i dont have# feature extract# window_size=500 and overlap=0 to exactly reconstruct our original samples.# fs=50 -> sampling freq (50Hz).X_tsfel = tsfel.time_series_features_extractor( cfg, df_stream, fs=50, window_size=500, overlap=0, verbose=0)print(f"Feature extraction complete. New shape: {X_tsfel.shape}")# check for NaNs (sometimes features like FFT fail on constant data)X_tsfel = X_tsfel.fillna(0)# standardizescaler = StandardScaler()X_tsfel_scaled = scaler.fit_transform(X_tsfel)
Code
# print(X_tsfel.columns)# print names of columns/features
3. PCA on Features provided by the guys who made the dataset (X_train.txt)
Code
# path = '../human+activity+recognition+using+smartphones/UCI HAR Dataset/features.txt'# with open(path, 'r') as f:# data = f.readlines()pathX="../human+activity+recognition+using+smartphones/UCI HAR Dataset/train/X_train.txt"pathY="../human+activity+recognition+using+smartphones/UCI HAR Dataset/train/y_train.txt"dataset_X_train = np.loadtxt(pathX)dataset_y_train = np.loadtxt(pathY)print(dataset_X_train.shape, dataset_y_train.shape)# 561 columns -> of 561 features. each column has value of that feature for all the samples. dataset_y_train has integers for each row range from 1 to 6 denoting which activity class the sample belong to. so now you know what to do :)pca = PCA(n_components=2)#X_train is already in (n_samples, n_features) format, 2D array, so direct pca nowX_reduced_m3 = pca.fit_transform(dataset_X_train)print("X_reduced shape",X_reduced_m3.shape)fig,axes = plt.subplots()axis=axesscatter = axis.scatter(X_reduced_m3[:, 0], X_reduced_m3[:, 1], c=dataset_y_train, cmap='viridis', alpha=0.6, s=10)handles, _ = scatter.legend_elements()activity_labels = [Activity_Classes[i] for i insorted(Activity_Classes.keys())]legend1 = axis.legend(handles, activity_labels, title="Activities", loc='lower left')axis.add_artist(legend1)axis.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)")axis.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)")axis.set_title("PCA on All 561 Features")plt.tight_layout()plt.show()
(7352, 561) (7352,)
X_reduced shape (7352, 2)
We can see that static and dynamic activities are seperated much more effectively. Also notice that the 3 class of activities in dynamic class are also being seperated nicely. So now we will analyse which features contributed most to the PC1 and PC2 and analyse which features played major role in distiguishing the classes more precisely
Code
path ='../human+activity+recognition+using+smartphones/UCI HAR Dataset/features.txt'features_df = pd.read_csv(path, sep='\s+', header=None, names=['idx', 'name'])feature_names = features_df['name'].valuesprint(f"Loaded {len(feature_names)} feature names.")print(pca.components_.shape)# analysing weights of features in PCs# pca.components_ -> (n_components, n_features) -> (2, 561)# row 0 = weights for PC1, row 1 = weights for PC2print("\n--- PC1 Interpretation ---")# abs val of weights , no need to directionpc1_weights = np.abs(pca.components_[0])# specific trick: argsort gives indices of sorted values. [-5:] takes top 5. [::-1] reverses to descending.top_5_pc1_idx = pc1_weights.argsort()[-5:][::-1]for idx in top_5_pc1_idx:print(f"Feature '{feature_names[idx]}' weight: {pc1_weights[idx]:.4f}")print("\n--- PC2 Interpretation ---")pc2_weights = np.abs(pca.components_[1])top_5_pc2_indices = pc2_weights.argsort()[-5:][::-1]for idx in top_5_pc2_indices:print(f"Feature '{feature_names[idx]}' weight: {pc2_weights[idx]:.4f}")
1st methods of PCA on raw data, ditinguishes the static and dynamic activity pretty well, but does not distinguish intra subactivity classes
2nd method of PCA on TSFEL features , distinguishes between static and dynamic classes and also intra sub activity of static class, but for dynamix class all 3 are pretty much stacked together. It successfully isolates “Laying” from the other static activities (due to the “Mean” feature capturing the orientation change). However, it struggles to distinguish “Sitting” from “Standing” as their statistical profiles are nearly identical.
3rd method of PCA on features by dataset providers ditinguishes between static and dynamic classes and also intra sub activity of dynamic class, but for static class all 3 are stacked together
in the third method we saw that
PC1 was heavily affected by Jerk and Entropy so it sperated static and dynamic classes
PC2 was heavily affected by gyro and also by freq , so it seperated the intra dynamic subclasses very well. As walking upward and downward has diffrent gyro.
So even though 3rd method should have better static subclass seperation, tsfel was better at distinguishing static subclasses. (that’s what data is showing atleast :))
Question 4: Correlation Matrix for features obtained by TSFEL and from Dataset. Identify the features that are highly correlated with each other. Are there any redundant features?
dataset for tsfel features: X_tsfel
dataset for dataset providers: dataset_X_train
let highly corelated - 0.85 to 0.95
redundant - corr > 0.95 -> this features are redundant because they provide similar information about the data. keeping both does not add significant value and may lead to overfitting in machine learning models.
# usinf heatmap is redundant as correlation matrix is large# so need programatic way to find# import seaborn as sns# sns.heatmap(corr_matrix_1, cmap='rocket')# sns.heatmap(corr_matrix_2, cmap='rocket')
Code
#upper triangle matrix with all zero below 1st diagonal (so that diagonal=0, as self corr=1, we need to ignore that)temp = np.triu(corr_matrix_1,k=1)# abs corr valuesindia = np.where((temp >=0.85) & (temp <=0.95))print(india[0].shape)high_corr_1 = np.zeros((india[0].shape[0],3))for i inrange(india[0].shape[0]): high_corr_1[i] = [india[0][i], india[1][i], temp[india[0][i], india[1][i]]]#sortsorted_indices = np.argsort(high_corr_1[:, 2])[::-1]high_corr_1 = high_corr_1[sorted_indices]print("Top 10 highly correlated feature pairs in TSFEL features:")for i inrange(10):print(f"{high_corr_1[i,2]:.4f} : [{X_tsfel.columns[int(high_corr_1[i,0])]} , {X_tsfel.columns[int(high_corr_1[i,1])] }]")